Nahsime Bahmanian: Missing Verb Illusion

RePsychLing SMLP2022

Author

Reinhold Kliegl

Published

September 2, 2022

1 WORK IN PROGRESS

Script to analyze the number agreement errors made by Spanish native speakers in a web-based pronoun production experiment (Experiment 1: number agreement). Created May 2022, by Sol Lago and Nasimeh Bahmanian.

2 Modeling issues

In the majority of the models, we had to simplify due to convergence issues. But even in the simplified models, we are facing singularity issue. I know that we should not simply ignore singularity warnings, but we decided not to simplify our models further. So, in addition to how to resolve convergence issues without oversimplifying a model, I would like to discuss singularity during the summer school and better understand the consequences of ignoring it.

3 Background

In this experiment participants were asked to describe scenes of moving objects by producing sentences (in Spanish) like the English example below:

   *The shield(s)* pipped  *the hat(s)*  below/above    *it/them*
    antecedent              attractor                    pronoun

The antecedent and the attractor either matched or mismatched in number. We are interested to see if responses in mismatch conditions differs from match conditions (1) in terms of accuracy, i.e. whether participants made agreement errors in producing the pronoun. (2) in term of latency in planning/production of the pronoun in error-free responses. For the latter, we are looking at the duration of critical regions as well as the likelihood of a non-zero pause immediately before the critical region.

4 ReadME

Some of the variables

  • ‘TargetResponseIdentical’
  • ‘ResponseComplete’
  • ‘NumberError’
  • ‘GenderError’
  • ‘OtherDifferences’

are only needed to create the two dataframes: accuracy data and latency data.

For the stat, depending on the analysis, I need different sets of variables:

  • Analysis of errors
    • ‘NumberError’
    • ‘Match’
    • ‘AntecedentNumber’
  • Analysis of pauses
    • ‘Segment’
    • ‘Pause’
    • ‘Match’
    • ‘AntecedentNumber’
  • Analysis of durations
    • ‘Segment’
    • ‘Duration’
    • ‘Match’
    • ‘AntecedentNumber’
    • ‘Structure’
    • ‘Syllable’
    • ‘NounGender’

5 Setup

#library(summarytools)                   # data frame visualization and summary
library(binom)                          # binomial confidence intervals 
library(tidyverse)                      # data manipulation and plotting
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggridges)                       # density plots
library(scales)                         # scales for plots

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(patchwork)                      # concatenate plots
#library(lmerTest)                       # frequentist stats
library(lme4)
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack
library(car)                            # plot residuals
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(MASS)                           # boxcox

Attaching package: 'MASS'

The following object is masked from 'package:patchwork':

    area

The following object is masked from 'package:dplyr':

    select
library(Hmisc)                          # descriptive statistics
Loading required package: lattice
Loading required package: survival
Loading required package: Formula

Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units
library(OneR)                           # vincentile

6 Preprocessing

6.1 Trials to exclude from the accuracy analysis.

EXCLUDE_ACCURACY <- c("wrong_word", "wrong_number", 
                      "pronoun_unintelligible", "other")

6.2 Number of vincentiles for vincentile plot.

N_VINCENTILES <- 6    

6.3 Import data

productiondata  <- read.table("./data/results_experiment1_long.csv",
                          sep = ",",
                          stringsAsFactors = TRUE, 
                          header = TRUE, 
                          quote = "'")

productiondata$Item <- factor(productiondata$Item)

6.4 Create data frame for accuracy analysis

accuracydata <- 
  mutate(productiondata) %>%
  # Keep one row per trial.
  dplyr::select(!("Segment":"Syllable")) %>%
  unique() %>%
  # Exclude incomplete responses.
  filter(ResponseComplete == "yes") %>%
  # Exclude responses with gender errors or distorted preambles.
  filter(!(GenderError %in% c("yes"))) %>%
  filter(!(OtherDifferences %in% EXCLUDE_ACCURACY)) %>%
  # Convert number errors to a numeric variable (0 = correct; 1 = error).
  mutate(NumberError = ifelse(NumberError == "no", 0, 1)) %>%
  # Re-order factor levels.
  mutate(Condition = factor(Condition, levels = c("ss", "sp", "pp", "ps"))) %>%
  mutate(AntecedentNumber = factor(AntecedentNumber, levels = c("singular", "plural"))) %>%
  droplevels()

6.5 Create data frame for latency analysis

latencydata <- mutate(productiondata) %>%
  
  # Exclude incomplete responses.
  filter(ResponseComplete == "yes") %>%
  
  # Exclude responses that are not identical to the target sentences.
  filter(TargetResponseIdentical == TRUE) %>% 
  
  # Re-order factor levels.
  mutate(Condition = factor(Condition, levels = c("ss", "sp", "pp", "ps"))) %>%
  mutate(AntecedentNumber = factor(AntecedentNumber, levels = c("singular", "plural"))) %>%
  mutate(Segment = factor(Segment, levels = c("onset",
                                              "antecedent",
                                              "verb",
                                              "attractor",
                                              "adverb",
                                              "pronoun",
                                              "attractor_pronoun"))) %>%
  droplevels()

6.6 Outlier exclusions for latency data

Outlier detection is done using the IQR method described here.

According to this method, outliers are all observations above Q75 + 1.5 IQR or below Q25 - 1.5 IQR (where Q25 and Q75 correspond to the first and third quartile, and IQR is the difference between them).

6.7 Visualize outliers according to the IQR method.

ggplot(latencydata, aes(x = Condition, y = Duration)) + geom_boxplot()

6.8 Compute outlier bounds per segment

outliers_iqr <- latencydata %>% 
  group_by(Segment) %>% 
  summarise(
    Qlow       = quantile(Duration, probs = 0.25),
    Qup        = quantile(Duration, probs = 0.75),
    Iqr        = Qup - Qlow,
    LowerBound = Qlow - 1.5 * Iqr,
    UpperBound = Qup + 1.5 * Iqr
  )

6.9 Replace outliers with NAs in latency data

latencydata <- 
  latencydata  %>% 
  
  # Add exclusion criterion to the latency data.
  left_join(dplyr::select(outliers_iqr, c(Segment, LowerBound, UpperBound))) %>% 
  
  # Revise durations: outliers replaced with NA.
  mutate(Duration = ifelse(Duration < LowerBound | Duration > UpperBound, NA, Duration)) %>%
  
  # Revise pauses: pause is NA when duration is NA.
  mutate(Pause = ifelse(is.na(Duration), NA, Pause)) %>%
  
  # Delete lower- and upper-bound columns.
  dplyr::select(-c(LowerBound, UpperBound))
Joining, by = "Segment"

6.10 Quantify exclusion rates per segment

outlier_exclusions <- 
  latencydata %>%
  
  # Group by segment.
  group_by(Segment, Condition) %>% 
  
  # Compute averages.
  summarise(
    NExcluded       = length(Duration[is.na(Duration)]),
    NTrial          = length(Duration),
    PercentExcluded = NExcluded / NTrial * 100)
`summarise()` has grouped output by 'Segment'. You can override using the
`.groups` argument.

Print to screen.

print(outlier_exclusions, n = 30)
# A tibble: 28 × 5
# Groups:   Segment [7]
   Segment           Condition NExcluded NTrial PercentExcluded
   <fct>             <fct>         <int>  <int>           <dbl>
 1 onset             ss               41   1115           3.68 
 2 onset             sp               27   1015           2.66 
 3 onset             pp               24   1023           2.35 
 4 onset             ps               18    998           1.80 
 5 antecedent        ss               97   1115           8.70 
 6 antecedent        sp              100   1015           9.85 
 7 antecedent        pp               92   1023           8.99 
 8 antecedent        ps               83    998           8.32 
 9 verb              ss               30   1115           2.69 
10 verb              sp               18   1015           1.77 
11 verb              pp               36   1023           3.52 
12 verb              ps               58    998           5.81 
13 attractor         ss               34   1115           3.05 
14 attractor         sp               42   1015           4.14 
15 attractor         pp               30   1023           2.93 
16 attractor         ps               27    998           2.71 
17 adverb            ss               13   1115           1.17 
18 adverb            sp               21   1015           2.07 
19 adverb            pp               15   1023           1.47 
20 adverb            ps               12    998           1.20 
21 pronoun           ss               18   1115           1.61 
22 pronoun           sp               19   1015           1.87 
23 pronoun           pp               10   1023           0.978
24 pronoun           ps               14    998           1.40 
25 attractor_pronoun ss               23   1115           2.06 
26 attractor_pronoun sp               24   1015           2.36 
27 attractor_pronoun pp               24   1023           2.35 
28 attractor_pronoun ps               39    998           3.91 

6.11 Error rates by condition

average_errors <- mutate(accuracydata) %>%
  group_by(Condition) %>%
  dplyr::summarise(
    .groups = "keep",
    NError = sum(NumberError),
    N      = length(NumberError),
    Mean   = NError / N * 100,
    CIlow  = as.numeric(binom.confint(NError, N, methods = "agresti-coull")["lower"]) * 100,
    CIhigh = as.numeric(binom.confint(NError, N, methods = "agresti-coull")["upper"]) * 100
    )

7 Plots

7.1 Condition

jitter         <- position_jitter(width = 0.2, height = 0, seed = 0)
antecedent_col <- c("#3C425A", "#CC9A41")
facet_names    <- c(`experiment1` = "Error rate") 
 
plot_errors <-
  accuracydata %>%
  
  # Compute mean error rates by participant.
  group_by(Experiment, Participant, Condition, AntecedentNumber) %>%
  summarise(.groups = "keep",
            NError = sum(NumberError),
            N      = length(NumberError),
            Mean   = NError / N) %>%
  
  # Add by-condition means to the data frame.
  group_by(Condition) %>%
  mutate(MeanCondition = mean(Mean, na.rm = TRUE)) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Condition, y = Mean, color = AntecedentNumber)) +
      
      # Draw by-participant means.
      geom_point(size = 0.6, alpha = 0.3, stroke = 0, position = jitter) +
      
      # Draw group means and labels.
      stat_summary(fun = mean, geom = "point", shape = 18, size = 2)  +
      geom_label(
        data = unique(dplyr::select(., c(Condition, MeanCondition, AntecedentNumber))),
        aes(x = Condition, y = MeanCondition, label = round(MeanCondition, 3) * 100),
        size = 2.5, nudge_y = 0.035, label.padding = unit(.15, "lines"), 
        label.size = 0.2, alpha = 0.5) +
      
      # Create a facet title..
      facet_wrap(. ~ Experiment, scales = "free", labeller = as_labeller(facet_names)) +
      
      # Customize axes and scales.
      labs(x = NULL, y = NULL) +
      scale_color_manual(values = antecedent_col) +
      scale_x_discrete(labels = c("SS", "SP", "PP", "PS")) +
      scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
      
      # Customize theme.
      theme_light() +
      theme(legend.position = "none") +
      theme(text = element_text(size = 8, colour = "gray28")) + 
      theme(strip.text.x = element_text(size = 8))
  }

plot_errors

# Save plot.
# ggsave("./figureserror_rates.jpeg", plot_errors, units = "cm", height = 5, width = 5, dpi = 300)

7.2 Average durations

average_duration <- 
  latencydata %>%
  
  # Compute averages by condition and segment.
  group_by(Segment, AntecedentNumber, Match) %>%
  dplyr::summarise(
    .groups = "keep",
    N      = length(Duration),
    Mean   = mean(Duration, na.rm = TRUE),
    SD     = sd(Duration, na.rm = TRUE),
    CIlow  = Mean - (qnorm(0.975) * SD / sqrt(N)),
    CIhigh = Mean + (qnorm(0.975) * SD / sqrt(N)),
    )

7.3 Parameters

jitter        <- position_jitter(width = 0.2, height = 0, seed = 0)
dodge         <- position_dodge(0.35)
color_cond    <- c("#3C425A", "#3C425A", "#CC9A41", "#CC9A41")
linetype_cond <- c("solid", "dotdash", "solid", "dotdash")
label_cond    <- toupper(levels(latencydata$Condition))
label_regions <- c("El escudo\nantecedent",
                   "ha pipeado\nverb",
                   "los sombreros\nattractor",
                   expression(~bold("(de) debajo\nadverb")),
                   expression(~bold("de él\npronoun")))

7.4 Sentence regions

plot_duration_all <- 
  latencydata %>%
  
  # Exclude utterance onset.
  filter(!Segment %in% c("onset", "attractor_pronoun")) %>% 
  
  # Separate critical region from other segments.
  mutate(SegmentType = factor("All regions")) %>%
  
  # Compute mean durations by participant.
  group_by(Participant, SegmentType, Segment, Condition, AntecedentNumber, Match) %>%
  summarise(.groups = "keep", Mean = mean(Duration, na.rm = TRUE)) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Segment, y = Mean, group = Condition, color = Condition)) +
      
      # Plot by-condition means and standard error bars.
      stat_summary(fun = mean, geom = "path", aes(linetype = Condition), 
                   size = 0.5, position = dodge, na.rm = TRUE)  +
      stat_summary(fun.data = mean_se, geom = "errorbar", size = 0.5, 
                   width = 0, position = dodge, show.legend = FALSE) +
      stat_summary(fun = mean, geom = "point", shape = 18, size = 2, 
                   position = dodge, show.legend = FALSE)  +
      
      # Place utterance onset and sentence segments in different facets.
      facet_grid( ~ SegmentType) + 
      
      # Customize axes and scales.
      labs(x = NULL, y = "Duration [ms]") +
      scale_x_discrete(labels = label_regions) +
      scale_color_manual(values = color_cond, labels = label_cond) +
      scale_linetype_manual(values = linetype_cond, labels = label_cond) +
      
      # Customize theme.
      theme_light() +
      theme(legend.position = "top") +
      theme(text = element_text(size = 10, colour = "gray28")) +
      theme(axis.text.x = element_text(size = 7, vjust = 0.5, margin = margin(10,0,0,0))) + 
      theme(strip.text.x = element_text(size = 10))
}

7.5 Post-attractor region

plot_duration_critical <- latencydata %>%
  
  # Select only post-attractor combined region.
  filter(Segment == "attractor_pronoun") %>% 
  
  # Separate critical region from other segments.
  mutate(SegmentType = factor("Adverb + pronoun region")) %>%
  
  # Compute mean durations by participant.
  group_by(Participant, SegmentType, Segment, Condition, AntecedentNumber, Match) %>%
  summarise(.groups = "keep", Mean = mean(Duration, na.rm = TRUE)) %>%
  
  # Add by-condition means to the data frame.
  group_by(Condition) %>%
  mutate(MeanCondition = mean(Mean)) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Condition, y = Mean, color = Condition)) +
      
      # Draw boxplots.
      geom_boxplot(size = 0.5, outlier.shape = NA) +
      
      # Draw by-participant means.
      geom_point(size = 1.5, alpha = 0.5, stroke = 0, position = jitter) +
      
      # Draw by-condition means.
      stat_summary(fun = mean, geom ="point", shape = 18, size = 4) +
      geom_label(
        data = unique(dplyr::select(., c(Condition, MeanCondition, AntecedentNumber))),
        aes(x = Condition, y = MeanCondition, label = round(MeanCondition, 0)),
        size = 3.5, nudge_y = 23, label.padding = unit(.15, "lines"), 
        alpha = 0.5, label.size = NA) +
      
      # Facets for each segment.
      facet_grid(. ~ SegmentType) +
      
      # Customize axes and scales.
      labs(x = NULL, y = NULL) +
      scale_color_manual(values = color_cond) +
      scale_x_discrete(labels = label_cond) +
      
      # Customize theme.
      theme_light() +
      theme(legend.position = "none") +
      theme(text = element_text(size = 10, colour = "gray28")) + 
      theme(strip.text.x = element_text(size = 10))
  }

7.6 Concatenate and save

theme_custom <- theme_light() + theme(legend.position = "top")
theme_set(theme_custom)

jpeg("./figures/durations_by_region.jpeg", units = "cm", width = 16, height = 12, res = 300)
(plot_duration_all | plot_duration_critical) + 
  plot_layout(widths = c(1.8, 1)) +
  plot_layout(guides = "collect")
dev.off()

7.7 Likelihood of pauses by condition

average_pauses <- mutate(latencydata) %>%
  
  # Keep only pause data.
  filter(!is.na(Pause)) %>%
  
  # Compute averages by condition and segment.
  group_by(Segment, Condition) %>%
  dplyr::summarise(
    .groups = "keep",
    NPause = sum(Pause),
    N      = length(Pause),
    Mean   = NPause / N,
    CIlow  = as.numeric(binom.confint(NPause, N, methods = "agresti-coull")["lower"]),
    CIhigh = as.numeric(binom.confint(NPause, N, methods = "agresti-coull")["upper"]))


# Plotting parameters
jitter <- position_jitter(width = 0.2, height = 0, seed = 0)

# Plot
plot_pause <- latencydata %>%
  
  # Keep only pause data.
  filter(!is.na(Pause)) %>%
  
  # Compute mean durations by participant.
  group_by(Participant, Segment, Condition, AntecedentNumber) %>%
  summarise(.groups = "keep",
            NPause = sum(Pause),
            N      = length(Pause),
            Mean   = NPause / N) %>%
  
  # Add by-condition means to the data frame.
  group_by(Segment, Condition) %>%
  mutate(MeanCondition = mean(Mean, na.rm = TRUE)) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Condition, y = Mean, color = AntecedentNumber)) +
      
      # Draw by-participant means.
      geom_point(size = 0.9, alpha = 0.3, stroke = 0, position = jitter) +
      
      # Draw group means and labels.
      stat_summary(fun = mean, geom ="point", shape = 18, size = 2) +
      geom_label(
        data = unique(dplyr::select(., c(Segment, Condition, MeanCondition, AntecedentNumber))),
        aes(x = Condition, y = MeanCondition, label = round(MeanCondition, 3) * 100),
        size = 2, nudge_y = 0.025, label.padding = unit(.15, "lines")) +

      # Facets for each segment.
      facet_wrap(. ~ Segment, ncol = 3, scales = "free") +
      
      # Customize axes and scales.
      labs(x = NULL, y = "Pause likelihood") +
      scale_color_manual(values = c("#3C425A", "#CC9A41")) +
      scale_x_discrete(labels = c("SS", "SP", "PP", "PS")) +
      scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
      
      # Customize theme.
      theme_light() +
      theme(legend.position = "none") +
      theme(text = element_text(size = 8, colour = "gray28")) + 
      theme(strip.text.x = element_text(size = 8))
  }
plot_pause

# Save plot.
# ggsave("./figures/pauses.jpeg", plot_pause, units = "cm", height = 10, width = 12, dpi = 300)

8 Contrasts

  • AntecedentNumber: sum coded (-1 singular, +1 plural).
  • Match: sum coded (-1 match, +1 mismatch)
  • Structure: sum coded (-1 no_preposition, +1 with_preposition)

RK: The recommendation is the stay with digits.

# contrasts for accuracy data
contrasts(accuracydata$AntecedentNumber) <- -contr.sum(2)  
contrasts(accuracydata$Match)            <- -contr.sum(2)  
contrasts(accuracydata$Structure)        <- -contr.sum(2)  

# contrasts for latency data.
contrasts(latencydata$AntecedentNumber) <- -contr.sum(2) 
contrasts(latencydata$Match)            <- -contr.sum(2) 
contrasts(latencydata$Structure)        <- -contr.sum(2) 

9 Check distribution

RK: I would recommend to do this much earlier. Possibly even before the outlier detection – unless they are clearly unreasonable.

The Box-Cox procedure is used to determine the best transformation (adapted from Osborne, 2010). + if 0 =< λ < 1, log transformation; + if -1 < λ < 0, reciprocal; + if λ = 1, no transformation needed.

RK: Hm?

10 Data transformation

10.1 Adverb + pronoun region

bc <- boxcox(lm(Duration ~ 1, data = filter(latencydata, Segment == "attractor_pronoun")))

bc <- data.frame(cbind(lambda = bc$x, lik = bc$y))
bc <- bc[bc$lik == max(bc$lik), "lambda"]

lambda = 0.51 so log transformation is used.

RK: Square-root would be an option, too.

10.2 Pronoun region

bc <- boxcox(lm(Duration ~ 1, data = filter(latencydata, Segment == "pronoun")))

bc <- data.frame(cbind(lambda = bc$x, lik = bc$y))
bc <- bc[bc$lik == max(bc$lik), "lambda"]

lambda = 0.83 so log transformation is used).

RK: I would not transform.

10.3 Onset region

bc <- boxcox(lm(Duration ~ 1, data = filter(latencydata, Segment == "onset")))

bc <- data.frame(cbind(lambda = bc$x, lik = bc$y))
bc <- bc[bc$lik == max(bc$lik), "lambda"]

lambda = 0.47 so log transformation is used).

RK: Square-root would be an option, too.

RK: I definitely would not use different transformations for different regions. My impression is that staying with the original metric would be ok.

11 (G)LMMs

11.1 GLMM for accuracy: AntecedentNumber * Match

NB: Random effects simplified due to non-convergence) RK:

  • The mpdel is not correctly specified. One problem is that you must not use factors with || in RES.
  • The specification may cause false-positive convergence errors.
  • Antecedent and Match look like within-Subject/Item factors.
m_accuracy <- glmer(NumberError ~ AntecedentNumber * Match 
                    + (1 + AntecedentNumber + Match | Participant) 
                    + (1 + AntecedentNumber + Match * Item), 

Here random effects were simplified due to non-convergence, but we may just as well start with the so-called maximal LMM.

f1 <- "./fits/m_acc.rda"
if(!file.exists(f1)){
m_acc <- glmer(NumberError ~ 1 + AntecedentNumber * Match +
                            (1 + AntecedentNumber * Match | Participant) +
                            (1 + AntecedentNumber * Match | Item),  
                    family = binomial,
                    data = accuracydata,
                    control = glmerControl(calc.derivs=FALSE))
save(m_acc, file=f1)
} else load(f1)

summary(rePCA(m_acc))
$Item
Importance of components:
                         [,1]   [,2]    [,3]    [,4]
Standard deviation     1.3724 0.9706 0.05705 0.02759
Proportion of Variance 0.6657 0.3329 0.00115 0.00027
Cumulative Proportion  0.6657 0.9986 0.99973 1.00000

$Participant
Importance of components:
                         [,1]   [,2]    [,3]    [,4]
Standard deviation     1.1118 0.6247 0.14893 0.02292
Proportion of Variance 0.7496 0.2366 0.01345 0.00032
Cumulative Proportion  0.7496 0.9862 0.99968 1.00000
VarCorr(m_acc)
 Groups      Name                     Std.Dev. Corr                
 Item        (Intercept)              0.80618                      
             AntecedentNumber1        0.87028  -0.524              
             Match1                   0.78221  -0.974  0.338       
             AntecedentNumber1:Match1 0.90023   0.318 -0.971 -0.112
 Participant (Intercept)              0.89646                      
             AntecedentNumber1        0.76636  -0.561              
             Match1                   0.20289  -0.167  0.791       
             AntecedentNumber1:Match1 0.46579   0.443 -0.961 -0.679

RK: This model looks like it is supported by the data. Some of the CPs are of suspicously large magnitugde. We will use MixedModels.jl to check whether they are significant different from zero.

11.2 GLMM for accuracy: AntecedentNumber / Match

RK: A nested model may be a good a priori or post-hoc alternative LMM. Usually, it makes more sense to use the same formula in the fixed-effect and RES part of the LMM – if such a complex RES is supported by the data.

f2 <- "./fits/m_acc_nested.rda"
if(!file.exists(f2)){
m_acc_nested <- glmer(NumberError ~ 1 + AntecedentNumber / Match +
                                   (1 + AntecedentNumber / Match | Participant) +
                                   (1 + AntecedentNumber / Match | Item),  
                                   family = binomial,
                                   data = accuracydata,
                                   control = glmerControl(calc.derivs=FALSE))
save(m_acc_nested, file=f2)
} else load(f2)

summary(rePCA(m_acc_nested))
$Item
Importance of components:
                         [,1]   [,2]    [,3]    [,4]
Standard deviation     1.4042 0.9526 0.03146 0.01630
Proportion of Variance 0.6845 0.3150 0.00034 0.00009
Cumulative Proportion  0.6845 0.9996 0.99991 1.00000

$Participant
Importance of components:
                         [,1]   [,2]    [,3]    [,4]
Standard deviation     1.0521 0.6150 0.33061 0.02907
Proportion of Variance 0.6939 0.2371 0.06851 0.00053
Cumulative Proportion  0.6939 0.9310 0.99947 1.00000
VarCorr(m_acc_nested)
 Groups      Name                            Std.Dev. Corr                
 Item        (Intercept)                     0.61779                      
             AntecedentNumber1               0.72512  -0.653              
             AntecedentNumbersingular:Match1 1.01865  -0.850  0.953       
             AntecedentNumberplural:Match1   0.96718  -0.211 -0.602 -0.334
 Participant (Intercept)                     0.85080                      
             AntecedentNumber1               0.69106  -0.570              
             AntecedentNumbersingular:Match1 0.43415  -0.221  0.914       
             AntecedentNumberplural:Match1   0.45327   0.449 -0.605 -0.632

We do get convergence failure for the nested LMM. This strongly suggests that the first LMM may not be supported as well as it looks. We check with MixedModels.jl.

11.3 LMM for duration for adverb + pronoun region: AntecedentNumber * Match

RK: Similar problems with original LMM.

m_duration_nested <- lmer(log(Duration) ~  1 + Syllable +  
                           AntecedentNumber + AntecedentNumber*Match 
                          (1 + AntecedentNumber + AntecedentNumber:Match | Participant) 
                          (1 + AntecedentNumber:Match || Item),
                          data = filter(latencydata, Segment == "attractor_pronoun"))
summary(m_duration_nested)

Again random effects were simplified due to non-convergence, but again we start with the so-called maximal LMM.

f3 <- "./fits/m_dur.rda"
if(!file.exists(f3)){
m_dur <- lmer(log(Duration) ~ 1 + (Syllable) + AntecedentNumber * Match  +
                             (1 + AntecedentNumber + Match | Participant) +
                             (1 + AntecedentNumber + Match | Item), 
                   data = filter(latencydata, Segment == "attractor_pronoun"),
                   REML=FALSE, control=lmerControl(calc.derivs=FALSE))
save(m_dur, file=f3)
} else load(f3)

summary(rePCA(m_dur))
$Item
Importance of components:
                         [,1]    [,2] [,3]
Standard deviation     0.1034 0.06366    0
Proportion of Variance 0.7250 0.27495    0
Cumulative Proportion  0.7250 1.00000    1

$Participant
Importance of components:
                         [,1]   [,2]    [,3]
Standard deviation     0.8617 0.2979 0.06146
Proportion of Variance 0.8892 0.1062 0.00452
Cumulative Proportion  0.8892 0.9955 1.00000
VarCorr(m_dur)
 Groups      Name              Std.Dev.  Corr         
 Item        (Intercept)       0.0093358              
             AntecedentNumber1 0.0026851 -0.946       
             Match1            0.0063700  0.273  0.053
 Participant (Intercept)       0.0822917              
             AntecedentNumber1 0.0284157 -0.107       
             Match1            0.0080785 -0.491 -0.411
 Residual                      0.0956802              

No convergence errors, but overparameterized in the Subject component.

Plot of main model residuals.

qqPlot(residuals(m_dur))

3378 3132 
3277 3032 

They look fine.

11.4 LMM for duration for adverb + pronoun region: AntecedentNumber / Match

f4 <- "./fits/m_dur_nested.rda"
if(!file.exists(f4)){
m_dur_nested <- lmer(log(Duration) ~ 1 + (Syllable) + AntecedentNumber / Match  +
                                    (1 + AntecedentNumber / Match | Participant) +
                                    (1 + AntecedentNumber / Match | Item), 
                   data = filter(latencydata, Segment == "attractor_pronoun"),
                   REML=FALSE, control=lmerControl(calc.derivs=FALSE))
save(m_dur, file=f4)
} else load(f4)

#summary(rePCA(m_dur_nested))
#VarCorr(m_dur_nested)

Similar to crossed version of LMM:

11.5 GLMM for pauses before adverb + pronoun region: AntecedentNumber * Match

Original version with random effects simplified due to non-convergence.

f5 <- "./fits/m_pause.rda"
if(!file.exists(f5)){
m_pause <- glmer(Pause ~ 1 + AntecedentNumber * Match +
                        (1 + AntecedentNumber * Match | Participant) +
                        (1 + Match | Item),
                        family = binomial,
                        data = filter(latencydata, Segment == "attractor_pronoun"), 
                        control = glmerControl(calc.derivs=FALSE))
save(m_pause, file=f5)
} else load(f5)

summary(rePCA(m_pause))
$Item
Importance of components:
                         [,1]   [,2]
Standard deviation     0.9072 0.4985
Proportion of Variance 0.7681 0.2319
Cumulative Proportion  0.7681 1.0000

$Participant
Importance of components:
                         [,1]   [,2]    [,3]      [,4]
Standard deviation     1.4022 0.9857 0.43889 0.0008602
Proportion of Variance 0.6281 0.3104 0.06153 0.0000000
Cumulative Proportion  0.6281 0.9385 1.00000 1.0000000
VarCorr(m_pause)
 Groups      Name                     Std.Dev. Corr                
 Item        (Intercept)              0.68552                      
             Match1                   0.77564  -0.526              
 Participant (Intercept)              1.33582                      
             AntecedentNumber1        0.50965   0.104              
             Match1                   0.12568   0.301  0.775       
             AntecedentNumber1:Match1 1.03464  -0.274 -0.508  0.050

11.6 GLMM for pauses before adverb + pronoun region: AntecedentNumber * Match

f6 <- "./fits/m_pause_nested.rda"
if(!file.exists(f6)){
m_pause_nested <- glmer(Pause ~ 1 + AntecedentNumber / Match +
                               (1 + AntecedentNumber / Match | Participant) +
                               (1 + Match | Item),
                               family = binomial,
                       data = filter(latencydata, Segment == "attractor_pronoun"), 
                       control = glmerControl(calc.derivs=FALSE))
save(m_pause, file=f6)
} else load(f6)

summary(rePCA(m_pause_nested))
$Item
Importance of components:
                         [,1]   [,2]
Standard deviation     0.7985 0.6080
Proportion of Variance 0.6331 0.3669
Cumulative Proportion  0.6331 1.0000

$Participant
Importance of components:
                         [,1]   [,2]    [,3]      [,4]
Standard deviation     1.4445 1.2395 0.55961 0.0001121
Proportion of Variance 0.5301 0.3903 0.07956 0.0000000
Cumulative Proportion  0.5301 0.9204 1.00000 1.0000000
VarCorr(m_pause_nested)
 Groups      Name                            Std.Dev. Corr                
 Item        (Intercept)                     0.71426                      
             Match1                          0.70505  -0.266              
 Participant (Intercept)                     1.26221                      
             AntecedentNumber1               0.53220   0.218              
             AntecedentNumbersingular:Match1 0.91762   0.361  0.164       
             AntecedentNumberplural:Match1   1.10352   0.001  0.269 -0.868

11.7 LMM for duration on pronoun region: AntecedentNumber * Match

Random effects simplified due to non-convergence.-

m_duration_pronoun <- lmer((log(Duration)) ~ 1 + Syllable + AntecedentNumber * Match + 
                                            (1 + AntecedentNumber + Match | Participant) +
                                            (1 + AntecedentNumber + Match | Item),  
                           data = filter(latencydata, Segment == "pronoun"), REML=FALSE,
                           control = lmerControl(calc.derivs=FALSE))
boundary (singular) fit: see help('isSingular')
summary(m_duration_pronoun)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: (log(Duration)) ~ 1 + Syllable + AntecedentNumber * Match + (1 +  
    AntecedentNumber + Match | Participant) + (1 + AntecedentNumber +  
    Match | Item)
   Data: filter(latencydata, Segment == "pronoun")
Control: lmerControl(calc.derivs = FALSE)

     AIC      BIC   logLik deviance df.resid 
 -2696.8  -2583.1   1366.4  -2732.8     4072 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.1313 -0.5063  0.0382  0.5599  6.0924 

Random effects:
 Groups      Name              Variance  Std.Dev. Corr       
 Item        (Intercept)       4.436e-04 0.021063            
             AntecedentNumber1 6.457e-05 0.008035 -1.00      
             Match1            2.505e-05 0.005005 -0.74  0.74
 Participant (Intercept)       1.916e-02 0.138435            
             AntecedentNumber1 2.893e-03 0.053783 -0.08      
             Match1            1.655e-05 0.004068 -0.64 -0.71
 Residual                      2.748e-02 0.165756            
Number of obs: 4090, groups:  Item, 112; Participant, 47

Fixed effects:
                          Estimate Std. Error t value
(Intercept)               5.019593   0.031980 156.960
Syllable                  0.318073   0.008950  35.540
AntecedentNumber1         0.079392   0.008617   9.213
Match1                    0.005965   0.002714   2.198
AntecedentNumber1:Match1 -0.008244   0.002607  -3.162

Correlation of Fixed Effects:
            (Intr) Syllbl AntcN1 Match1
Syllable    -0.768                     
AntcdntNmb1  0.149 -0.263              
Match1      -0.102  0.010 -0.138       
AntcdnN1:M1  0.004 -0.007  0.012  0.020
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
qqPlot(residuals(m_duration_pronoun))

3378 2812 
3325 2760 

11.8 LMM for duration on pronoun region: AntecedentNumber / Match

Random effects simplified due to non-convergence.

m_duration_pronoun_nested <- lmer((log(Duration)) ~ Syllable +
                                  + AntecedentNumber + AntecedentNumber:Match 
                                  + (1 + AntecedentNumber + AntecedentNumber:Match | Participant) 
                                  + (1 + AntecedentNumber:Match || Item),
                                  data = filter(latencydata, Segment == "pronoun"))
boundary (singular) fit: see help('isSingular')
summary(m_duration_pronoun_nested)
Linear mixed model fit by REML ['lmerMod']
Formula: 
(log(Duration)) ~ Syllable + +AntecedentNumber + AntecedentNumber:Match +  
    (1 + AntecedentNumber + AntecedentNumber:Match | Participant) +  
    ((1 | Item) + (0 + AntecedentNumber:Match | Item))
   Data: filter(latencydata, Segment == "pronoun")

REML criterion at convergence: -2799

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.4108 -0.4900  0.0279  0.5345  6.1572 

Random effects:
 Groups      Name                                   Variance  Std.Dev.  Corr 
 Item        AntecedentNumbersingular:Matchmatch    4.494e-04 2.120e-02      
             AntecedentNumberplural:Matchmatch      1.096e-05 3.311e-03  1.00
             AntecedentNumbersingular:Matchmismatch 5.223e-04 2.285e-02 -0.22
             AntecedentNumberplural:Matchmismatch   2.339e-06 1.530e-03 -0.99
 Item.1      (Intercept)                            4.370e-12 2.090e-06      
 Participant (Intercept)                            1.950e-02 1.396e-01      
             AntecedentNumber1                      2.946e-03 5.428e-02 -0.07
             AntecedentNumbersingular:Match1        1.782e-03 4.221e-02 -0.17
             AntecedentNumberplural:Match1          8.721e-04 2.953e-02  0.11
 Residual                                           2.645e-02 1.626e-01      
            
            
            
 -0.22      
 -0.99  0.08
            
            
            
  0.06      
 -0.21 -0.99
            
Number of obs: 4090, groups:  Item, 112; Participant, 47

Fixed effects:
                                 Estimate Std. Error t value
(Intercept)                      5.008490   0.029297 170.958
Syllable                         0.322131   0.007607  42.346
AntecedentNumber1                0.078796   0.008580   9.184
AntecedentNumbersingular:Match1  0.014337   0.007324   1.957
AntecedentNumberplural:Match1   -0.002276   0.005666  -0.402

Correlation of Fixed Effects:
                (Intr) Syllbl AntcN1 AntcdntNmbrs:M1
Syllable        -0.713                              
AntcdntNmb1      0.113 -0.226                       
AntcdntNmbrs:M1 -0.105  0.007  0.037                
AntcdntNmbrp:M1  0.056 -0.001 -0.144 -0.628         
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
qqPlot(residuals(m_duration_pronoun_nested))

3378 2812 
3325 2760 

12 Supplementary analysis: utterance onset latency —-

Random effects simplified due to non-convergence.

m_duration_onset <- lmer(log(Duration) ~ AntecedentNumber * Match 
                           + (1 + AntecedentNumber + Match | Participant) 
                           + (1 + AntecedentNumber + Match | Item),  
                           data = filter(latencydata, Segment == "onset"))
boundary (singular) fit: see help('isSingular')
summary(m_duration_onset)
Linear mixed model fit by REML ['lmerMod']
Formula: log(Duration) ~ AntecedentNumber * Match + (1 + AntecedentNumber +  
    Match | Participant) + (1 + AntecedentNumber + Match | Item)
   Data: filter(latencydata, Segment == "onset")

REML criterion at convergence: -7026.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.9424 -0.5362 -0.0496  0.5342  4.2238 

Random effects:
 Groups      Name              Variance  Std.Dev. Corr       
 Item        (Intercept)       2.465e-05 0.004964            
             AntecedentNumber1 1.077e-05 0.003282  0.82      
             Match1            3.695e-05 0.006079  0.09 -0.49
 Participant (Intercept)       9.801e-03 0.099001            
             AntecedentNumber1 7.770e-05 0.008815 -0.30      
             Match1            1.478e-05 0.003844  0.25 -0.28
 Residual                      9.550e-03 0.097723            
Number of obs: 4041, groups:  Item, 112; Participant, 47

Fixed effects:
                           Estimate Std. Error t value
(Intercept)               7.4636950  0.0145338 513.540
AntecedentNumber1        -0.0011262  0.0020393  -0.552
Match1                    0.0007659  0.0017422   0.440
AntecedentNumber1:Match1 -0.0001813  0.0015437  -0.117

Correlation of Fixed Effects:
            (Intr) AntcN1 Match1
AntcdntNmb1 -0.181              
Match1       0.084 -0.092       
AntcdnN1:M1 -0.002  0.020  0.018
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
qqPlot(residuals(m_duration_onset))

1803 2137 
1771 2092 

13 Supplementary analysis: effect of optional preposition

13.1 Plot durations by structure (boxplots)

# Plotting parameters
jitter <- position_jitter(width = 0.2, height = 0, seed = 0)

# Plot
plot_duration_boxplots_byStructure <- latencydata %>%

  # Keep only regions after the attractor
  filter(Segment %in% c("adverb", "pronoun", "attractor_pronoun")) %>% 
  
  # Compute mean durations by participant.
  group_by(Participant, Segment, Condition, AntecedentNumber, Structure) %>%
  summarise(.groups = "keep", Mean = mean(Duration, na.rm = TRUE)) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Condition, y = Mean, color = AntecedentNumber)) +
      
      # Draw boxplots.
      geom_boxplot(size = 0.4, outlier.shape = NA, na.rm = TRUE) +
      
      # Draw by-participant means.
      geom_point(size = 0.6, alpha = 0.5, stroke = 0, position = jitter, na.rm = TRUE) +
      
      # Facets for each segment and structure.
      facet_wrap(Segment ~ Structure, ncol = 2, scales = "free") +
      
      # Customize axes and scales.
      labs(x = NULL, y = "Duration [ms]") +
      scale_color_manual(values = c("#3C425A", "#CC9A41")) +
      scale_x_discrete(labels = c("SS", "SP", "PP", "PS")) +
      
      # Customize theme.
      theme_light() +
      theme(legend.position = "none") +
      theme(text = element_text(size = 8, colour = "gray28")) + 
      theme(strip.text.x = element_text(size = 8))
  }
plot_duration_boxplots_byStructure

# Save plot.
#ggsave("durations_critical_boxplots_byStructure.jpeg", 
#       plot_duration_boxplots_byStructure, 
#       units = "cm", height = 15, width = 10, dpi = 300)

13.2 Plot durations by structure (noodleplots)

# Plotting parameters.
jitter        <- position_jitter(width = 0.2, height = 0, seed = 0)
dodge         <- position_dodge(0.35)
color_cond    <- c("#3C425A", "#3C425A", "#CC9A41", "#CC9A41")
shape_cond    <- c(16, 16, 17, 17)
linetype_cond <- c("solid", "dotdash", "solid", "dotdash")
label_cond    <- toupper(levels(latencydata$Condition))
label_regions <- c("adverb","pronoun")

# Plot.
plot_duration_noodle_byStructure <- latencydata %>%
  
  # Exclude utterance onset.
  filter(Segment %in% c("adverb", "pronoun")) %>% 
  
  # Compute mean durations by participant.
  group_by(Participant, Structure, Segment, Condition, AntecedentNumber, Match) %>%
  summarise(.groups = "keep", Mean = mean(Duration, na.rm = TRUE)) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Segment, y = Mean, group = Condition, color = Condition)) +
      
      # Plot by-condition means and standard error bars.
      stat_summary(fun = mean, geom = "path", aes(linetype = Condition), 
                   size = 0.5, position = dodge, na.rm = TRUE)  +
      stat_summary(fun.data = mean_se, geom = "errorbar", size = 0.5, 
                   width = 0, position = dodge, show.legend = FALSE, na.rm = TRUE) +
      stat_summary(fun = mean, geom = "point", aes(shape = Condition), 
                   size = 2, position = dodge, show.legend = FALSE, na.rm = TRUE)  +
      
      # Facets for each segment.
      facet_wrap(. ~ Structure) + 
      
      # Customize axes and scales.
      labs(x = NULL, y = "Duration [ms]") +
      scale_x_discrete(labels = label_regions) +
      scale_color_manual(values = color_cond, labels = label_cond) +
      scale_shape_manual(values = shape_cond, labels = label_cond) +
      scale_linetype_manual(values = linetype_cond, labels = label_cond) +
      
      # Customize theme.
      theme_light() +
      theme(legend.position = "top") +
      theme(text = element_text(size = 10, colour = "gray28")) +
      theme(strip.text.x = element_text(size = 10))
}
plot_duration_noodle_byStructure 

# Save plot.
#ggsave("durations_critical_noodle_byStructure.jpeg", 
#       plot_duration_noodle_byStructure, 
#       units = "cm", height = 7, width = 9, dpi = 300)

14 LMM models

14.1 Adverb + pronoun region: Structure * AntecedentNumber * Match

Random effects simplified due to non-convergence

# Compute model (random effects simplified due to non-convergence).
m_duration_strucure <- 
  lmer(log(Duration) ~ Structure * AntecedentNumber * Match 
       + (1 + Structure + AntecedentNumber + Match | Participant) 
       + (1 + Structure + AntecedentNumber + Match | Item),  
       data = filter(latencydata, Segment == "attractor_pronoun"))
boundary (singular) fit: see help('isSingular')
# Print main model summary.
summary(m_duration_strucure)
Linear mixed model fit by REML ['lmerMod']
Formula: 
log(Duration) ~ Structure * AntecedentNumber * Match + (1 + Structure +  
    AntecedentNumber + Match | Participant) + (1 + Structure +  
    AntecedentNumber + Match | Item)
   Data: filter(latencydata, Segment == "attractor_pronoun")

REML criterion at convergence: -6868.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.4374 -0.5821 -0.0026  0.5678  5.7559 

Random effects:
 Groups      Name              Variance  Std.Dev. Corr             
 Item        (Intercept)       1.514e-03 0.038909                  
             Structure1        5.880e-05 0.007668 -0.32            
             AntecedentNumber1 9.425e-04 0.030700 -0.99  0.46      
             Match1            4.871e-05 0.006979  0.42  0.15 -0.37
 Participant (Intercept)       5.632e-03 0.075047                  
             Structure1        1.071e-03 0.032723  0.32            
             AntecedentNumber1 8.362e-04 0.028918 -0.12  0.13      
             Match1            7.381e-05 0.008591 -0.30 -0.96 -0.40
 Residual                      8.988e-03 0.094808                  
Number of obs: 4041, groups:  Item, 112; Participant, 47

Fixed effects:
                                      Estimate Std. Error t value
(Intercept)                          6.6320474  0.0122428 541.711
Structure1                           0.0553281  0.0073169   7.562
AntecedentNumber1                    0.0660554  0.0053580  12.328
Match1                               0.0047389  0.0020930   2.264
Structure1:AntecedentNumber1         0.0005687  0.0032162   0.177
Structure1:Match1                    0.0006784  0.0018725   0.362
AntecedentNumber1:Match1            -0.0089477  0.0015345  -5.831
Structure1:AntecedentNumber1:Match1 -0.0020666  0.0015904  -1.299

Correlation of Fixed Effects:
            (Intr) Strct1 AntcN1 Match1 St1:AN1 St1:M1 AN1:M1
Structure1   0.283                                           
AntcdntNmb1 -0.242  0.093                                    
Match1      -0.128 -0.382 -0.249                             
Strctr1:AN1  0.043  0.047  0.057  0.039                      
Strctr1:Mt1 -0.150 -0.126  0.004  0.097 -0.210               
AntcdnN1:M1 -0.005  0.018  0.011  0.023 -0.008   0.023       
Str1:AN1:M1  0.003 -0.042 -0.001  0.017  0.032  -0.030  0.126
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
# Plot main model residuals.
qqPlot(residuals(m_duration_strucure))

3378 2490 
3277 2419 

14.2 Adverb + pronoun region: Structure + AntecedentNumber / Match

m_duration_structure_nested <- 
  lmer(log(Duration) ~  Structure + AntecedentNumber/Match +
       + (1 + Structure + AntecedentNumber + AntecedentNumber:Match | Participant) 
       + (1 + Structure + AntecedentNumber:Match || Item),
       data = filter(latencydata, Segment == "attractor_pronoun"))
boundary (singular) fit: see help('isSingular')
summary(m_duration_structure_nested)
Linear mixed model fit by REML ['lmerMod']
Formula: 
log(Duration) ~ Structure + AntecedentNumber/Match + +(1 + Structure +  
    AntecedentNumber + AntecedentNumber:Match | Participant) +  
    ((1 | Item) + (0 + Structure | Item) + (0 + AntecedentNumber:Match |  
        Item))
   Data: filter(latencydata, Segment == "attractor_pronoun")

REML criterion at convergence: -7068.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.9003 -0.5636  0.0015  0.5636  5.7980 

Random effects:
 Groups      Name                                   Variance  Std.Dev. Corr 
 Item        AntecedentNumbersingular:Matchmatch    4.673e-03 0.068359      
             AntecedentNumberplural:Matchmatch      3.842e-05 0.006198 1.00 
             AntecedentNumbersingular:Matchmismatch 5.427e-03 0.073667 0.93 
             AntecedentNumberplural:Matchmismatch   1.032e-04 0.010158 0.96 
 Item.1      Structureno_preposition                0.000e+00 0.000000      
             Structurewith_preposition              1.312e-04 0.011454  NaN 
 Item.2      (Intercept)                            0.000e+00 0.000000      
 Participant (Intercept)                            5.973e-03 0.077288      
             Structure1                             7.386e-04 0.027177  0.21
             AntecedentNumber1                      8.311e-04 0.028829 -0.11
             AntecedentNumbersingular:Match1        9.103e-04 0.030172 -0.30
             AntecedentNumberplural:Match1          3.132e-04 0.017697  0.20
 Residual                                           8.425e-03 0.091786      
                  
                  
                  
 0.93             
 0.96  0.99       
                  
                  
                  
                  
                  
  0.12            
 -0.43 -0.51      
 -0.05  0.46 -0.88
                  
Number of obs: 4041, groups:  Item, 112; Participant, 47

Fixed effects:
                                 Estimate Std. Error t value
(Intercept)                      6.633383   0.012330 537.999
Structure1                       0.053806   0.006542   8.225
AntecedentNumber1                0.066076   0.005330  12.397
AntecedentNumbersingular:Match1  0.013657   0.005046   2.707
AntecedentNumberplural:Match1   -0.003494   0.003350  -1.043

Correlation of Fixed Effects:
                (Intr) Strct1 AntcN1 AntcdntNmbrs:M1
Structure1       0.190                              
AntcdntNmb1     -0.239  0.057                       
AntcdntNmbrs:M1 -0.221 -0.234 -0.386                
AntcdntNmbrp:M1  0.163 -0.011  0.253 -0.587         
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
qqPlot(residuals(m_duration_structure_nested))

3378 2490 
3277 2419 

15 Supplementary analysis: density plots for durations

15.1 Density plots by region

# Plotting parameters.
color_cond <- c("#3C425A", "#999ba9", "#CC9A41", "#ebcb9f")
fill_cond  <- c("#3C425A", "#999ba9", "#CC9A41", "#ebcb9f")
label_cond <- toupper(levels(latencydata$Condition))

# Plot.
plot_density_by_region <- latencydata %>%
  
  # Compute mean durations by participant.
  group_by(Participant, Segment, Condition, AntecedentNumber, Match) %>%
  
  # Select regions.
  filter(Segment %in% c("attractor", "adverb", "pronoun", "attractor_pronoun")) %>% 
  
  # Make condition labels capitals.
  mutate(Condition = factor(toupper(Condition), levels = c("SS", "SP", "PP", "PS"))) %>%
  
  # Draw plot.
  {ggplot(., aes(x = Duration, y = AntecedentNumber, color = Condition, fill = Condition)) +
      
      # Draw boxplots.
      geom_density_ridges(alpha = 0.6, na.rm = TRUE) +
      
      # Facets for each segment.
      facet_wrap(. ~ Segment, ncol = 2, scales = "free") +
      
      # Customize axes and scales.
      labs(x = "[ms]", y = "Antecedent number") +
      coord_cartesian(ylim = c(1, 4)) +
      scale_fill_manual(values = fill_cond, labels = label_cond) +
      scale_color_manual(values = color_cond, labels = label_cond) +
      guides(fill = guide_legend(override.aes = list(shape = 16, size = 0.5))) + 
      
      # Customize theme.
      theme_light() +
      theme(panel.grid.major = element_blank()) +
      theme(panel.grid.minor = element_blank()) +
      theme(legend.text = element_text(size = 8), legend.position = "top") +
      theme(text = element_text(size = 10, colour = "gray28")) +
      theme(strip.text.x = element_text(size = 10))
  }
plot_density_by_region
Picking joint bandwidth of 17.3
Picking joint bandwidth of 16.3
Picking joint bandwidth of 19
Picking joint bandwidth of 25.7

# Save plot.
#ggsave("density_by_region.jpeg", plot_density_by_region , 
#       units = "cm", height = 10, width = 16, dpi = 300)

15.2 Density plots by structure

# Plotting parameters.
color_cond <- c("#3C425A", "#999ba9", "#CC9A41", "#ebcb9f")
fill_cond  <- c("#3C425A", "#999ba9", "#CC9A41", "#ebcb9f")
label_cond <- toupper(levels(latencydata$Condition))

# Plot.
plot_density_critical_byStructure <- latencydata %>%
  
  # Compute mean durations by participant.
  group_by(Participant, Segment, Condition, AntecedentNumber) %>%
  
  filter(Segment == "attractor_pronoun") %>% 
  
  # Draw plot.
  {ggplot(., aes(x = Duration, color = Condition, fill = Condition)) +
      
      # Draw boxplots.
      geom_density(alpha = 0.6, na.rm = TRUE) +
      
      # Facets for each segment.
      facet_grid(Structure ~ AntecedentNumber, scales = "free", space = "free") +
      
      # Customize axes and scales.
      labs(x = "Duration [ms]") +
      scale_fill_manual(values = fill_cond, labels = label_cond) +
      scale_color_manual(values = color_cond, labels = label_cond) +
      scale_linetype_manual(values = linetype_cond, labels = label_cond) +

      # Customize theme.
      theme_light() +
      theme(legend.position = "top") +
      theme(text = element_text(size = 10, colour = "gray28")) +
      theme(strip.text.x = element_text(size = 10))
  }
plot_density_critical_byStructure

# Save plot.
# ggsave("density_postattractor_byStructure.jpeg", plot_density_critical_byStructure, 
#         units = "cm", height = 12, width = 16, dpi = 300)

16 Supplementary analysis: vincentile plots for durations

16.1 Select participants

Participants with fewer observations than the desired number of vincentiles are excluded on a by-condition basis.

vincentile_exclude <- mutate(latencydata) %>%
  
  # Select regions of interest.
  filter(Segment %in% c("attractor", "adverb", "pronoun", "attractor_pronoun")) %>% 
  droplevels() %>% 
  
  # Keep only relevant columns.
  dplyr::select(c("Participant", "Condition","Segment","Duration")) %>%  
  distinct(.keep_all = TRUE) %>% 
  
  # Group by participant, condition and region.
  group_by(Segment, Participant, Condition) %>%
  
  # Count number of observations by group.
  summarise(Count = length(Duration), .groups = "drop") %>%
  mutate(KeepForVincentile = factor(ifelse(Count < N_VINCENTILES, "no", "yes"))) 

16.2 Compute vincentiles

Apply exclusions, select regions/factors of interest and exclude NAs.

vincentiles <-
  
  # Do data exclusions.
  left_join(latencydata, vincentile_exclude)  %>%
  filter(KeepForVincentile == "yes")  %>%
  filter(Segment %in% c("attractor", "adverb", "pronoun", "attractor_pronoun")) %>%
  filter(!is.na(Duration)) %>%
  droplevels() %>%

  # Select relevant columns.
  dplyr::select(c("Segment", "Participant", "Item", "Condition", 
                  "AntecedentNumber", "Match", "Duration")) %>%  
  distinct(.keep_all = TRUE) %>% 
  
  # Assign observations to equally-populated bins.
  # [FIXME]: the bins don't seem equally populated!
  group_by(Segment, Participant, Condition, AntecedentNumber, Match) %>% 
  mutate(Bin = bin(Duration, nbins = N_VINCENTILES, method = "content")) %>%
  
  # Compute mean duration per bin.
  group_by(Segment, Participant, Condition, AntecedentNumber, Match, Bin) %>%
  arrange(Segment, Participant, Condition, AntecedentNumber, Match, Duration) %>%
  mutate(BinMean = mean(Duration)) %>%
  
  # Get rid of unused columns.
  dplyr::select(-c("Item", "Duration")) %>%
  distinct(.keep_all = TRUE) %>%
  
  # Sort data frame by bin number.
  group_by(Segment, Participant, Condition, AntecedentNumber, Match) %>%
  mutate(BinNumber = order(BinMean)) %>%
  droplevels()
Joining, by = c("Participant", "Condition", "Segment")

16.3 Plot

# Parameters
dodge         <- position_dodge(0.35)
color_cond    <- c("#3C425A", "#999ba9", "#CC9A41", "#ebcb9f")
linetype_cond <- c("solid", "dotdash", "solid", "dotdash")

plot_vincentiles <-
  
  # Draw vincentiles.
  ggplot(vincentiles, aes(x = factor(BinNumber), y = BinMean, 
                          group = Condition, colour = Condition)) +
  stat_summary(fun = mean, geom = "path", aes(linetype = Condition), 
               size = 0.6, position = dodge) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, 
               size = 0.5, position = dodge, show.legend = FALSE) +
  stat_summary(fun = mean, geom = "point", size = 1.5, 
               position = dodge, show.legend = FALSE) +
  
  # Use different facets for each segment.
  facet_wrap(. ~ Segment, ncol = 2, scales = "free") +
  
  # Customize axes and scales.
  labs(x = "Vincentile", y = "[ms]") +
  scale_x_discrete(labels = c(1:N_VINCENTILES)) +
  scale_colour_manual(values = color_cond, labels = label_cond) +
  scale_linetype_manual(values = linetype_cond, labels = label_cond) +
  
  # Customize theme.
  theme_light() +
  theme(text = element_text(size = 10, colour = "gray28")) +
  theme(legend.position = "top") +
  theme(legend.title = element_text(size = 10), legend.text = element_text(size = 10))

plot_vincentiles

# Save plot.
# ggsave("vincentiles.jpeg", plot_vincentiles, units = "cm", 
#       height = 12, width = 16, dpi = 300)

17 RK extensions

17.1 Save dataframes for transfer to Julia

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
dat_acc <- 
  accuracydata |> 
  as_tibble() |> 
  dplyr::select(Subj=Participant, Item, Condition, Strct=Structure, Match, ACN=AntecedentNumber, nerr=NumberError) |> 
  mutate(
    Subj = factor(paste0("S", str_pad(str_remove(Subj, "s"), width = 2, side = "left", pad = "0"))),
    Item = factor(paste0("I", str_pad(Item, width = 3, side = "left", pad = "0"))))
write_feather(dat_acc, "./data/Bahmanian_acc.arrow")
  
dat_dur <- 
  latencydata |> 
  as_tibble() |> 
  dplyr::select(Subj=Participant, Item, Condition, Strct=Structure, Match, ACN=AntecedentNumber, Segment, dur=Duration, pause=Pause) |> 
  mutate(
    Subj = factor(paste0("S", str_pad(str_remove(Subj, "s"), width = 2, side = "left", pad = "0"))),
    Item = factor(paste0("I", str_pad(Item, width = 3, side = "left", pad = "0"))))
write_feather(dat_dur, "./data/Bahmanian_dur.arrow")

17.2 Boxcox

library(arrow)
library(tidyverse)

dat <- read_feather("./data/Bahmanian_dur.arrow")
dat_dur <- dat |> filter(!is.na(dur))

MASS::boxcox(dur ~ 1 + Match + ACN + Subj, data=dat_dur)    # reciprocal value

dat_pronoun <- dat_dur |> filter(Segment == "pronoun") 
MASS::boxcox(dur ~ 1 + Match + ACN + Subj, data=dat_pronoun) # original value

dat_onset <- dat_dur |> filter(Segment == "onset")
MASS::boxcox(dur ~ 1 + Match + ACN + Subj, data=dat_onset)   # log-transformed value

Depending on which segment we look at, we get different proposals; not quite sure what to recommend in this case. Reciprocal is out. We could tap into different processes for pronoun and onset segments, requiring different metrics. However, if at all possible, I would not change metrics across analyses for the same data, because this is very difficult to communicate. The log-transform sounds like a reasonable choice, not the least because that’s usually required for all kinds of fixation measures.

18 Appendix

sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] arrow_9.0.0     OneR_2.2        Hmisc_4.7-1     Formula_1.2-4  
 [5] survival_3.4-0  lattice_0.20-45 MASS_7.3-58.1   car_3.1-0      
 [9] carData_3.0-5   lme4_1.1-30     Matrix_1.4-1    patchwork_1.1.2
[13] scales_1.2.1    ggridges_0.5.3  forcats_0.5.2   stringr_1.4.1  
[17] dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0    
[21] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2 binom_1.1-1.1  

loaded via a namespace (and not attached):
 [1] nlme_3.1-159        fs_1.5.2            bit64_4.0.5        
 [4] lubridate_1.8.0     RColorBrewer_1.1-3  httr_1.4.4         
 [7] tools_4.2.1         backports_1.4.1     utf8_1.2.2         
[10] R6_2.5.1            rpart_4.1.16        DBI_1.1.3          
[13] colorspace_2.0-3    nnet_7.3-17         withr_2.5.0        
[16] gridExtra_2.3       tidyselect_1.1.2    bit_4.0.4          
[19] compiler_4.2.1      cli_3.3.0           rvest_1.0.3        
[22] htmlTable_2.4.1     xml2_1.3.3          labeling_0.4.2     
[25] checkmate_2.1.0     digest_0.6.29       foreign_0.8-82     
[28] minqa_1.2.4         rmarkdown_2.16      base64enc_0.1-3    
[31] jpeg_0.1-9          pkgconfig_2.0.3     htmltools_0.5.3    
[34] dbplyr_2.2.1        fastmap_1.1.0       htmlwidgets_1.5.4  
[37] rlang_1.0.5         readxl_1.4.1        rstudioapi_0.14    
[40] farver_2.1.1        generics_0.1.3      jsonlite_1.8.0     
[43] googlesheets4_1.0.1 magrittr_2.0.3      interp_1.1-3       
[46] Rcpp_1.0.9          munsell_0.5.0       fansi_1.0.3        
[49] abind_1.4-5         lifecycle_1.0.1     stringi_1.7.8      
[52] yaml_2.3.5          plyr_1.8.7          grid_4.2.1         
[55] crayon_1.5.1        deldir_1.0-6        haven_2.5.1        
[58] splines_4.2.1       hms_1.1.2           knitr_1.40         
[61] pillar_1.8.1        boot_1.3-28         reprex_2.0.2       
[64] glue_1.6.2          evaluate_0.16       latticeExtra_0.6-30
[67] data.table_1.14.2   modelr_0.1.9        vctrs_0.4.1        
[70] png_0.1-7           nloptr_2.0.3        tzdb_0.3.0         
[73] cellranger_1.1.0    gtable_0.3.1        assertthat_0.2.1   
[76] xfun_0.32           broom_1.0.1         googledrive_2.0.0  
[79] gargle_1.2.0        cluster_2.1.4       ellipsis_0.3.2